data boundary
Hierarchical clustering by aggregating representatives in sub-minimum-spanning-trees
Xie, Wen-Bo, Liu, Zhen, Srivastava, Jaideep
One of the main challenges for hierarchical clustering is how to appropriately identify the representative points in the lower level of the cluster tree, which are going to be utilized as the roots in the higher level of the cluster tree for further aggregation. However, conventional hierarchical clustering approaches have adopted some simple tricks to select the "representative" points which might not be as representative as enough. Thus, the constructed cluster tree is less attractive in terms of its poor robustness and weak reliability. Aiming at this issue, we propose a novel hierarchical clustering algorithm, in which, while building the clustering dendrogram, we can effectively detect the representative point based on scoring the reciprocal nearest data points in each sub-minimum-spanning-tree. Extensive experiments on UCI datasets show that the proposed algorithm is more accurate than other benchmarks. Meanwhile, under our analysis, the proposed algorithm has O(nlogn) time-complexity and O(logn) space-complexity, indicating that it has the scalability in handling massive data with less time and storage consumptions.
Peak Criterion for Choosing Gaussian Kernel Bandwidth in Support Vector Data Description
Kakde, Deovrat, Chaudhuri, Arin, Kong, Seunghyun, Jahja, Maria, Jiang, Hansi, Silva, Jorge
Abstract--Support V ector Data Description (SVDD) is a machine-learning technique used for single class classification and outlier detection. SVDD formulation with kernel function provides a flexible boundary around data. The value of kernel function parameters affects the nature of the data boundary. For example, it is observed that with a Gaussian kernel, as the value of kernel bandwidth is lowered, the data boundary changes from spherical to wiggly. The spherical data boundary leads to underfitting, and an extremely wiggly data boundary leads to overfitting. In this paper, we propose an empirical criterion to obtain good values of the Gaussian kernel bandwidth parameter . This criterion provides a smooth boundary that captures the essential geometric features of the data. Support V ector Data Description (SVDD) is a machine learning technique used for single-class classification and outlier detection.